Note: In all cases, the number of notifications is per month, and the percentages are out of the total number of users who received any notifications, or the total number of notifications received.
The distribution of notifications seems largely to follow a long-tailed power law distribution. Almost all notified users receive very few (about 95% get from 1 to 4), while a very small number of highly-active users account for a large number of notifications (with about 0.1% receiving 10-15% of the total notifications).
However, the distributions did vary somewhat between the wikis, falling into three main groups:
Users generally don't build up large numbers of unread notifications; for example, out of the 1203 enwiki users who had 30 or more notifications total, only 3.7% had 30 or more unread notifications.
I suggest the following conclusions based on this data:
I chose to look at notification use on five different wikis:
Commons, enwiki, and jawiki represent "normal" large wikis; on the other hand, frwiki and zhwiki are relatively heavy users of Flow (frwiki at its central discussion board and zhwiki on user talk pages). Flow generates a large number of notifications but is not in use at many wikis; in making heavy use of notifications, it represents the likely future direction of MediaWiki software.
The data was generated from the Echo extension's database tables using the following queries:
Total notifications
SELECT
DATABASE() as "wiki",
`notifications`,
COUNT(*) as "users"
FROM
(
SELECT
COUNT(*) as "notifications"
FROM echo_notification
WHERE
notification_timestamp > "20160219" AND
notification_timestamp < "20160321" AND
notification_bundle_base = 1
GROUP BY notification_user
) notifications_by_user
GROUP BY `notifications`;
Unread notifications
SELECT
DATABASE() as "wiki",
`unread notifications`,
COUNT(*) as "users"
FROM
(
SELECT
SUM( IF( notification_read_timestamp IS NULL, 1, 0) ) as "unread notifications"
FROM echo_notification
WHERE
notification_timestamp > "20160219" AND
notification_timestamp < "20160321" AND
notification_bundle_base = 1
GROUP BY notification_user
) notifications_by_user
GROUP BY `unread notifications`;
I gathered the results from the 5 different wikis using multiquery with the following command:
$ multiquery notifications_per_user.sql --dbnames=notifications_dbs.tsv --host=x1-analytics-slave.eqiad.wmnet --defaults-file=~/.my.cnf > ~/notifications_per_user.tsv
I ran the unread query on 28 March, so the unread counts reflect notifications from the month which were still unread about 7 days after it ended.
The queries include notification_bundle_base = 1
to exclude "bundled" notifications, which (1) don't behave as a standalone notifications from the user's point of view and (2) are never directly marked as read in the database (instead their read status is computed
There are downsides to this approach. The bundled notifications are real notifications (although they're generally less interesting to the user than stand-alone notifications), and omitting them will undercount notification activity. However, the largest purpose of this study is to understand whether users are currently overloaded with notifications, so we shouldn't ignore bundling's effect in reducing that load. In addition, the number of unread notifications is an important measure in this study, and the logic used to determine whether a bundled notification has been read is too complex to reimplement here.
Accounting for bundling has a dramatic effect unread notifications; for example, a previous version of this study found that 19% of enwiki users with at least 25 notifications had at least 25 unread notifications. After excluding bundled notifications, that figure went down to 2%.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
%matplotlib inline
In [2]:
notifs = pd.read_table("./notifications_per_user.tsv")
unreads = pd.read_table("./unread_notifications_per_user.tsv")
wikis = set(notifs["wiki"])
notifs.tail()
Out[2]:
In [3]:
def filter_by_wiki( df, wiki ):
return df[ df["wiki"] == wiki ].iloc[:, 1:3]
def plot_by_wiki( df, wiki, range = (5, 104), bins = 20, ax = plt ):
dist = filter_by_wiki( df, wiki )
ax.hist( dist.iloc[:, 0], bins = bins, range = range, weights = dist.iloc[:, 1])
ax.set_title(wiki)
ax.set_xlabel( "Number of notifications" )
ax.set_ylabel( "Users" )
This data covers all the users who received at least one notification during the month, whether they actually visited the site during month or not, so we'd expect that the numbers are dominated by a large bulk of users with very few notifications, and that there's a long tail of very few users with a very large number of notifications.
But let's characterize that a bit. At each wiki, how many and what percent of users with any notifications got fewer than 5?
In [4]:
def beyond_threshold(df, wikis, threshold, direction):
columns = [
"wiki",
"users",
"% of users",
"% of notifications"
]
results = []
for wiki in wikis:
by_wiki = filter_by_wiki(df, wiki)
total_users = by_wiki.iloc[:, 1].sum()
total_notifs = 0
for row in by_wiki.iterrows():
total_notifs += row[1][0] * row[1][1]
if direction == "under":
beyond_threshold = by_wiki[ by_wiki.iloc[:, 0] < threshold ]
elif direction == "over":
beyond_threshold = by_wiki[ by_wiki.iloc[:, 0] > threshold ]
users_beyond_threshold = beyond_threshold.iloc[:, 1].sum()
notifs_beyond_threshold = 0
for row in beyond_threshold.iterrows():
notifs_beyond_threshold += row[1][0] * row[1][1]
user_proportion = users_beyond_threshold / total_users
notifs_proportion = notifs_beyond_threshold / total_notifs
results.append([
wiki,
users_beyond_threshold,
round(user_proportion * 100, 1),
round(notifs_proportion * 100, 1)
])
results = pd.DataFrame(results, columns=columns)
return results
beyond_threshold( notifs, wikis, 5, "under")
Out[4]:
5 or more?
In [5]:
beyond_threshold(notifs, wikis, 4, "over")
Out[5]:
And what percent of users got 25 notifications or more—becoming more or less "daily notified"?
In [6]:
beyond_threshold(notifs, wikis, 24, "over")
Out[6]:
That's lower than I expected at English Wikipedia. It only had about 1,200 users with at least 30 notifications per month, compared to 3,500 highly active users (100+ edits) per month. However, both Flow wikis have higher percentages than the non-Flow wikis.
Now, let's look at the actual distributions. To make it easier to comprehend, I'll cut off the 90%+ of users with fewer than 5 notifications. I'll also cut off the users with 100 or more. How many is that?
In [7]:
beyond_threshold(notifs, wikis, 99, "over")
Out[7]:
In [8]:
fig, axarr = plt.subplots( 5, 1, figsize=(12,30) )
fig.suptitle("Total notifications per user", fontsize=24)
fig.subplots_adjust(top=0.95)
i = 0
for wiki in wikis:
plot_by_wiki(notifs, wiki, ax = axarr[i])
i = i + 1
So, as expected, all the wikis have a pretty regular power-law distribution of notifications.
In [9]:
beyond_threshold(unreads, wikis, 5, "under")
Out[9]:
In [10]:
beyond_threshold(unreads, wikis, 4, "over")
Out[10]:
In [11]:
beyond_threshold(unreads, wikis, 24, "over")
Out[11]:
In [12]:
beyond_threshold(unreads, wikis, 99, "over")
Out[12]:
In [13]:
fig, axarr = plt.subplots( 5, 1, figsize=(12,30) )
fig.suptitle("Unread notifications per user", fontsize=24)
fig.subplots_adjust(top=0.95)
i = 0
for wiki in wikis:
plot_by_wiki(unreads, wiki, ax = axarr[i])
i = i + 1